# Display output within the notebook
# knitr::opts_chunk$set(echo = TRUE, results = 'show')
For more on regression, see LSR.
Using X to predict Y
Is there an association between X and Y?
A scatterplot visualizes the relationship between two quantitative variables on the same individuals.
age <- c(1, 2, 2, 2, 2, 3, 4, 4, 5, 5)
cost <- c(350, 370, 480, 520, 590, 550, 750, 800, 790, 950)
plot(age, cost, pch = 16, las = 1, cex = 1, xlab = "Age of Bus (years)", ylab = "Annual Maintenance Cost ($)")
#From Camm, et. al., Business Analytics, 3e, 356
# A *time series* is a scatterplot with time (seconds, months, years, decades) on the x-axis (as the explanatory variable).
Outcome = Model + Error
Avoid “overfitting”; don’t play connect the dots!
Introductory, baseline ML models
| Model | Data |
|---|---|
| Linear regression | One predictor with a continuous outcome |
| Multiple regression | Multiple predictors with a continuous outcome |
| Logistic regression | One predictor with a binary outcome |
| Multiple logistic regression | Multiple predictors with a binary outcome |
| Polynomial regression | Relationship between the predictor(s) and the continuous outcome is nonlinear |
| …and more… |
A response variable (y) measures an outcome of a study.
An explanatory variable (x) may help predict or explain changes in a response variable.
If knowing the value of one variable helps us predict the value of the other, there is an association between the two variables.
Regressing Y on X
# mtcars is built-in data within R
plot(mtcars$hp, mtcars$mpg, pch = 16, las = 1, cex = 1, main = "Terminology Practice", xlab = "Horsepower", ylab = "Miles Per Gallon")
The dataset comes from a national veterans’ organization that frequently solicits donations through direct mail campaigns to its database of current and prospective donors. The organization sent out a test mailing to a group of potential donors and gathered information on the response to that test mailing.
From Practical Machine Learning with R by Nwanganga and Chapple (p. 166-68)
Download the donors.csv and this Rmd notebook from
Brightspace.
Remember to save the csv in the same folder as the Rmd notebook; otherwise, you will need to set your working directory. See previous module for video assistance.
# Load the data as "donors"
donors <- read.csv("donors - Copy.csv")
summary(donors)
age numberChildren incomeRating wealthRating mailOrderPurchases totalGivingAmount numberGifts smallestGiftAmount
Min. : 1.00 Min. :1.000 Min. :1.000 Min. :0.000 Min. : 0.000 Min. : 13.0 Min. : 1.000 Min. : 0.000
1st Qu.:48.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 40.0 1st Qu.: 3.000 1st Qu.: 3.000
Median :62.00 Median :1.000 Median :4.000 Median :6.000 Median : 0.000 Median : 78.0 Median : 7.000 Median : 5.000
Mean :61.61 Mean :1.528 Mean :3.886 Mean :5.346 Mean : 3.321 Mean : 104.5 Mean : 9.602 Mean : 7.934
3rd Qu.:75.00 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.: 3.000 3rd Qu.: 131.0 3rd Qu.: 13.000 3rd Qu.: 10.000
Max. :98.00 Max. :7.000 Max. :7.000 Max. :9.000 Max. :241.000 Max. :9485.0 Max. :237.000 Max. :1000.000
NA's :23665 NA's :83026 NA's :21286 NA's :44732
largestGiftAmount averageGiftAmount yearsSinceFirstDonation monthsSinceLastDonation inHouseDonor plannedGivingDonor sweepstakesDonor
Min. : 5 Min. : 1.286 Min. : 0.000 Min. : 0.00 Mode :logical Mode :logical Mode :logical
1st Qu.: 14 1st Qu.: 8.385 1st Qu.: 2.000 1st Qu.:12.00 FALSE:88709 FALSE:95298 FALSE:93795
Median : 17 Median : 11.636 Median : 5.000 Median :14.00 TRUE :6703 TRUE :114 TRUE :1617
Mean : 20 Mean : 13.348 Mean : 5.596 Mean :14.36
3rd Qu.: 23 3rd Qu.: 15.478 3rd Qu.: 9.000 3rd Qu.:17.00
Max. :5000 Max. :1000.000 Max. :13.000 Max. :23.00
P3Donor state urbanicity socioEconomicStatus isHomeowner gender respondedMailing
Mode :logical Length:95412 Length:95412 Length:95412 Mode:logical Length:95412 Mode :logical
FALSE:93395 Class :character Class :character Class :character TRUE:52354 Class :character FALSE:90569
TRUE :2017 Mode :character Mode :character Mode :character NA's:43058 Mode :character TRUE :4843
# Note some weird features!
# See ages, numberChildren, lots of NAs
# We need a codebook!
Sample: 50 donors (n = 50)
Population: All donors (N = 95,412)
# For reproducibility
set.seed(1)
# Randomly select 50 numbers to represent rows in dataframe
sampleIDs <- sample(1:95412, size = 50, replace = FALSE)
# Select the columns I'm interested in
myColumns <- c("age", "incomeRating", "totalGivingAmount", "largestGiftAmount", "isHomeowner", "gender", "urbanicity", "respondedMailing")
# Store the randomly selected rows on the given columns
sampleDonors <- donors[sampleIDs, myColumns]
sampleDonors
NA
With NAs
cor(sampleDonors[1:50, na.rm=FALSE, c("age", "incomeRating", "totalGivingAmount", "largestGiftAmount")])
G1;H1;Errorh in `[.data.frame`(sampleDonors, 1:50, na.rm = FALSE, c("age", "incomeRating", :
unused argument (na.rm = FALSE)
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
?cor
Removing NAs
round(cor(sampleDonors[1:50, c("age", "incomeRating", "totalGivingAmount", "largestGiftAmount")], use = "complete.obs"), 2)
age incomeRating totalGivingAmount largestGiftAmount
age 1.00 -0.29 0.13 -0.20
incomeRating -0.29 1.00 0.11 0.32
totalGivingAmount 0.13 0.11 1.00 0.54
largestGiftAmount -0.20 0.32 0.54 1.00
plot(sampleDonors$age,
sampleDonors$incomeRating,
cex = 0.5,
las = 1,
pch = 16,
xlab = "Age",
ylab = "Income Rating")
*Be sure you finished the task in the prior video!
outcome = model + error
giftCor <- round(cor(sampleDonors$largestGiftAmount, sampleDonors$totalGivingAmount), 2)
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("r = ", giftCor))
Review: Describe the relationship between the largest gift amount and total giving for this sample of 50 donors.
“Trend line”
“Line of Best Fit”
“Least Squares Regression Line” (LSRL)
“Ordinary Least Squares Regression” (OLSR)
lm(y ~ x) = linear model predicting y given x
# Calculate coefficients of best fitting linear model
model <- lm(sampleDonors$totalGivingAmount ~ sampleDonors$largestGiftAmount)
##response first (Y), predictor (x)
##51.82 is the coeff, y intercept , touches the y axis
##3.91 is the slope of the line
model
Call:
lm(formula = sampleDonors$totalGivingAmount ~ sampleDonors$largestGiftAmount)
Coefficients:
(Intercept) sampleDonors$largestGiftAmount
51.82 3.91
Intercept is the y-intercept, where the best-fitting line hits the y-axis
Slope is the “incline” of the best-fitting line
“Slope-Intercept Form”
\[ y = mx + b \]
m is the slope b is the y-intercept
In statistics, we write it differently: \[ \widehat{y} = a + b(x) \\ \widehat{y} = a + bx \\ \widehat{y} = b_0 + b_1x \]
The latter form makes the most sense, given multiple regression (when we have multiple coefficients, not just two).
\(b_0\) and \(b_1\) are parameters (also as \(\theta_0\) and \(\theta_1\) or \(\beta_0\) and \(\beta_1\))
Why the hat \(\hat{y}\)? Why is it important?
Outcome = Model + Error
\[y = a + bx + \epsilon \\\] or \[\hat{y} = a + bx \\\]
It’s a trend, not a guarantee; not “deterministic”, not “exact”
lm(y ~ x)$coefficients[1] to access the y-intercept
lm(y ~ x)$coefficients[2] to access the slope
# Store the coefficients on the linear model
my_intercept <- round(model$coefficients[1], 1)
my_slope <- round(model$coefficients[2], 1)
# Plot scatterplot
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("LSRL: ", "Predicted y = ", my_intercept, "+", my_slope, "(x)"))
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
Before we can understand what makes the LSRL the “best fitting” line, we need to understand what a residual is.
Let’s consider the scatterplot:
# Store the coefficients on the linear model
my_intercept <- round(model$coefficients[1], 1)
my_slope <- round(model$coefficients[2], 1)
# Plot scatterplot
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("LSRL: ", "Predicted y = ", my_intercept, "+", my_slope, "(x)"))
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
Based on the LSRL, which donors gave more than expected based on their largest gift amount?
Based on the LSRL, which donors gave less than expected based on their largest gift amount?
A residual is the difference between the actual value of y and the value of y predicted by the regression line.
Residual = actual y - predicted y
or
Residual = observed y - expected y
or
\[ \epsilon = y - \hat{y} \]
A positive residual means…
A negative residual means…
# Create a simple dataset
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10) # Perfectly linear relationship (y = 2x)
# Introduce one positive residual and one negative residual
y[2] <- y[2] + 2 # Positive residual
y[4] <- y[4] - 2 # Negative residual
# Fit a linear model
ex_model <- lm(y ~ x)
# Plot the data and the fitted line
plot(x, y, pch = 16, cex = 1.5, las = 1, col = "black", main = "Positive & Negative Residuals")
abline(ex_model, col = "blue", lwd = 3)
# Add residual lines
for (i in 1:length(x)) {
lines(c(x[i], x[i]), c(y[i], predict(ex_model)[i]), col = "red", lty = 2, lwd = 2.5)
}
“the line underpredicts when…”
“the line overpredicts when…”
If a residual is close to (or equals) zero, that means…
The sum of all the residuals = __.
Note: A residual is the vertical distance from a point to the LSRL
# Find the donor who gave $150 as largest gift amount
sampleDonors[sampleDonors$largestGiftAmount == 150, ]
# Alternate approach
which(sampleDonors$largestGiftAmount == 150)
[1] 46
### 46 is the 46th row in sampleDonors and does not match the original index of 13602
# Pick select columns
sampleDonors[sampleDonors$largestGiftAmount == 150, c("largestGiftAmount", "totalGivingAmount")]
190 - (51.8 + 3.9* (150))
[1] -446.8
##
Interpretation: for the is doner whos largetst gitf was 150, the lsrl overpred their total giving my 446
or this donor gave 446 less than predicted.
Find the donor in this sample whose largest gift amount was $100, and then calculate and interpret the residual for this donor.
which(sampleDonors$largestGiftAmount==100)
[1] 5
sampleDonors[sampleDonors$largestGiftAmount == 100, c("largestGiftAmount", "totalGivingAmount")]
#which(sampleDonors$largestGiftAmount == 150)
#190 - (51.8 + 3.9* (150))
921- (51.8 + 3.9(100))
G1;H1;Errorh: attempt to apply non-function
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
Interpretation:
lm(y ~ x)$residuals
# List all residuals
model$residuals
1 2 3 4 5 6 7 8 9 10
-41.18416812 -147.29733501 -41.91299372 -107.38203262 478.22223837 -30.00907904 -66.91299372 14.08700628 -58.91868757 139.49092096
11 12 13 14 15 16 17 18 19 20
-45.20124969 -64.00907904 2.08700628 -95.46103638 -25.65320703 -0.00907904 -124.55712170 -88.37064491 9.63504895 -83.00907904
21 22 23 24 25 26 27 28 29 30
-36.73221078 -29.55712170 23.63504895 -78.55712170 -58.91299372 132.99092096 165.03896362 -27.46103638 61.44287830 140.44287830
31 32 33 34 35 36 37 38 39 40
97.71974656 296.08971117 45.26778922 300.41440901 -43.73221078 -62.28025344 -62.28025344 58.99661482 246.71974656 -73.46103638
41 42 43 44 45 46 47 48 49 50
-1.27455959 65.70266499 -124.55712170 -51.09377665 32.08700628 -448.25818826 -50.00907904 -20.18986197 -59.91299372 -61.91299372
# Order the residuals
#sort(model$residuals)
# Create linear regression model
model <- lm(sampleDonors$totalGivingAmount ~ sampleDonors$largestGiftAmount)
# Plot scatterplot
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("LSRL: ", "y-hat = ", my_intercept, "+", my_slope, "(x)"))
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
# Plot each residual from y to predicted y (i.e. fitted value) with segments()
## segments(x1, y1, x2, y2)
segments(sampleDonors$largestGiftAmount, # x1
fitted(model), # y1
sampleDonors$largestGiftAmount, # x2
sampleDonors$totalGivingAmount, # y2
col = "blue",
lty = 2)
We can plot the explanatory variable (or the predictions) vs. the residuals on a plot called a residual plot. More on residual plots later.
# Plot scatterplot with LSRL
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = "Scatterplot")
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
# Residual Plot
plot(sampleDonors$largestGiftAmount,
model$residuals,
xlab = "Largest Gift Amount",
ylab = "Residual",
las = 1,
pch = 16,
cex = 1.5,
main = "Residual Plot")
abline(h = 0, col = "red", lwd = 2)
R comes with a built-in residual plot (which includes other diagnostic graphs).
plot(model)
Another applet: Guess the LSRL Applet
The least-squares regression is the best-fitting line because it ___ the ___ of the squared ___.
\[ d_1^2+d_2^2+d_3^2+d_4^2+ ...+d_n^2= \mbox{a minimum} \]
\[ e_1^2+e_2^2+e_3^2+e_4^2+ ...+e_n^2= \mbox{a minimum} \]
\[ \underset{b_0, b_1}{\text{min}} \mbox{ }\Sigma(residuals)^2 \]
\[ \underset{b_0, b_1}{\text{min}} \mbox{ }\Sigma(y-b_0-b_1x)^2 \]
\[ b = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \]
To derive these formulas, we need multivariate calculus and linear algebra.
\(\hat{y} = a + b(x)\)
The centroid \((\bar{x}, \bar{y})\) always lies on the LSRL.
\[ a = \bar{y} - b(\bar{x}) \]
# Note different syntax with data = ...
model <- lm(totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
#totalGiving Amount (response) (X) ~ largest explanatory (y)
model
Call:
lm(formula = totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
Coefficients:
(Intercept) largestGiftAmount
51.82 3.91
summary(model)
Call:
lm(formula = totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
Residuals:
Min 1Q Median 3Q Max
-448.26 -62.28 -33.37 41.97 478.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.817 27.292 1.899 0.0636 .
largestGiftAmount 3.910 0.782 5.000 8.07e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138.8 on 48 degrees of freedom
Multiple R-squared: 0.3424, Adjusted R-squared: 0.3287
F-statistic: 25 on 1 and 48 DF, p-value: 8.075e-06
Identify the slope and the y-intercept in the regression output above.
model <- lm(totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
G1;H1;Errorh in eval(mf, parent.frame()) : object 'sampleDonors' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
model
G1;H1;Errorh: object 'model' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
### NOTE:
### model <- lm(sampleDonors$totalGivingAmount ~ sampleDonors$largestGiftAmount)
### also works here but creates an issue with predict() below
# Store the coefficients on the linear model
my_intercept <- round(model$coefficients[1], 1)
G1;H1;Errorh: object 'model' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
print("Intercept")
[1] "Intercept"
my_slope <- round(model$coefficients[2], 1)
G1;H1;Errorh: object 'model' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
print("Slope")
[1] "Slope"
# Plot scatterplot
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("LSRL: ", "Predicted y = ", my_intercept, "+", my_slope, "(x)"))
G1;H1;Errorh: object 'sampleDonors' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
G1;H1;Errorh: object 'model' not found
Error during wrapup: not that many frames on the stack
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
g
The slope is a rate of change.
\[ slope = \frac{rise}{run} = \frac{\Delta y}{\Delta x} = \frac{y_2-y_1}{x_2-x_1} = \frac{slope}{1} \]
To interpret the slope:
\[ Y-intercept: (0, y) \]
To interpret the y-intercept:
Interpret the slope and y-intercept of our linear model:
model$coefficients
(Intercept) largestGiftAmount
51.816908 3.909609
Slope: We predict the total giving to increase $3.91 for each additional $1 increase in largest gift amount.
Y-intercept: When a donor’s largest gift amount is $0, the predicted total giving is $51.82.
NOT THIS: “The total giving will increase $3.90 for
every $1 increase in highest gift amount”
“We predict a resting heart rate of -57 bpm for a body weight of 0 pounds.”
Making a prediction outside the domain of our explanatory variable.
Use the LSRL to predict the total giving of a donor whose largest given was $5000.
model
Call:
lm(formula = totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
Coefficients:
(Intercept) largestGiftAmount
51.82 3.91
51.82 + 3.91*5000
[1] 19601.82
predict(model, data.frame(largestGiftAmount = 5000))
1
19599.86
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1.5,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("What is extrapolation?"))
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
position <- 1:10
rate <- c(23.53, 14.94, 11.19, 7.47, 5.29, 3.8, 2.79, 2.11, 1.57, 1.18)
# Create scatterplot
plot(position, # variable on x-axis
rate, # variable on y-axis
pch = 16, # filled-in circles
cex = 2, # larger circle size
las = 1, # rotate numbers on y-axis
xlab = "Position", # x-axis label
ylab = "Click-through Rate") # y-axis label
abline(rate_model, col="red", 2)
# 1) Find the linear model for position vs. rate
rate_model <- lm (rate ~ position)
rate_model
Call:
lm(formula = rate ~ position)
Coefficients:
(Intercept) position
19.243 -2.156
# 2) Construct a residual plot (use position vs. residuals).
rate_model$residuals
1 2 3 4 5 6 7 8 9 10
6.442909091 0.008484848 -1.585939394 -3.150363636 -3.174787879 -2.509212121 -1.363636364 0.111939394 1.727515152 3.493090909
plot (position, rate_model$residuals, pch=16)
# 3) Include a horizontal line at residual = 0.
abline(h=0, col="red")
If the residual plot shows a curved pattern or a fanning out/in shape, the model was __curved_____. not appropriate
If the residual plot shows random scatter, the model was appropriate.
The model should explain the pattern; if there’s a left-over pattern, try a different model (e.g., quadratic)
model <- lm(totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
plot(sampleDonors$largestGiftAmount,
model$residuals,
xlab = "Largest Gift Amount",
ylab = "Residual",
las = 1,
pch = 16,
cex = 1.5)
abline(h = 0, col = "red", lwd = 2)
Is a linear model an appropriate model for the relationships between age and largest gift amount?
plot(sampleDonors$age, sampleDonors$largestGiftAmount)
# Create linear model
model_age <- lm(sampleDonors$largestGiftAmount ~ sampleDonors$age)
# Plot model diagnostics
plot(model_age)
For more on residual plots, see LSR.
Residual plots: did I pick an appropriate model?
R-Squared: does the model do a good job?
For more on r-squared, see LSR.
# Have both plots appear in one viewing window (1 row, 2 columns)
par(mfrow = c(1, 2))
# Plot 1
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = "Should we use the mean?")
# Plot mean response
abline(h = mean(sampleDonors$totalGivingAmount), lwd = 2, col = "blue")
# Plot 2
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = "Should we use the model?")
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
Which model should we use to make predictions: the mean of the response or the LSRL?
# View response variable values
sampleDonors$totalGivingAmount
[1] 38.0 100.0 49.0 93.0 921.0 100.0 24.0 105.0 75.0 269.5 163.0 66.0 93.0 15.0 163.0 130.0 25.0 26.0 81.0 47.0 62.0 120.0 95.0 71.0
[25] 32.0 263.0 275.5 83.0 211.0 290.0 216.0 503.9 144.0 665.0 55.0 56.0 56.0 146.0 365.0 37.0 74.0 313.0 25.0 32.0 123.0 190.0 80.0 102.0
[49] 31.0 29.0
# Find deviation from point to the mean response
totGivingDev <- sampleDonors$totalGivingAmount - mean(sampleDonors$totalGivingAmount)
totGivingDev
[1] -108.578 -46.578 -97.578 -53.578 774.422 -46.578 -122.578 -41.578 -71.578 122.922 16.422 -80.578 -53.578 -131.578 16.422 -16.578
[17] -121.578 -120.578 -65.578 -99.578 -84.578 -26.578 -51.578 -75.578 -114.578 116.422 128.922 -63.578 64.422 143.422 69.422 357.322
[33] -2.578 518.422 -91.578 -90.578 -90.578 -0.578 218.422 -109.578 -72.578 166.422 -121.578 -114.578 -23.578 43.422 -66.578 -44.578
[49] -115.578 -117.578
# Square the deviations
totGivingDev^2
[1] 11789.182084 2169.510084 9521.466084 2870.602084 599729.434084 2169.510084 15025.366084 1728.730084 5123.410084 15109.818084
[11] 269.682084 6492.814084 2870.602084 17312.770084 269.682084 274.830084 14781.210084 14539.054084 4300.474084 9915.778084
[21] 7153.438084 706.390084 2660.290084 5712.034084 13128.118084 13554.082084 16620.882084 4042.162084 4150.194084 20569.870084
[31] 4819.414084 127679.011684 6.646084 268761.370084 8386.530084 8204.374084 8204.374084 0.334084 47708.170084 12007.338084
[41] 5267.566084 27696.282084 14781.210084 13128.118084 555.922084 1885.470084 4432.630084 1987.198084 13358.274084 13824.586084
# Calculate total sum of squares
sst <- sum(totGivingDev^2)
sst
[1] 1407256
# Find deviation from point to LSRL
model$residuals
24388 59521 43307 69586 11571 25173 32618 13903 8229 25305
-41.18416812 -147.29733501 -41.91299372 -107.38203262 478.22223837 -30.00907904 -66.91299372 14.08700628 -58.91868757 139.49092096
90597 22306 12204 43809 72611 92490 36244 45399 81580 6519
-45.20124969 -64.00907904 2.08700628 -95.46103638 -25.65320703 -0.00907904 -124.55712170 -88.37064491 9.63504895 -83.00907904
92199 19242 87320 82446 21875 58472 91095 62956 21323 13284
-36.73221078 -29.55712170 23.63504895 -78.55712170 -58.91299372 132.99092096 165.03896362 -27.46103638 61.44287830 140.44287830
7976 9392 3863 52253 26876 88684 13973 31334 39241 47959
97.71974656 296.08971117 45.26778922 300.41440901 -43.73221078 -62.28025344 -62.28025344 58.99661482 246.71974656 -73.46103638
28278 66394 72299 11367 95283 13602 5051 16920 29660 56659
-1.27455959 65.70266499 -124.55712170 -51.09377665 32.08700628 -448.25818826 -50.00907904 -20.18986197 -59.91299372 -61.91299372
# Square the deviations
(model$residuals)^2
24388 59521 43307 69586 11571 25173 32618 13903
1696.13570357291 21696.50490175439 1756.69904217855 11530.90093030444 228696.50926800512 900.54482480329 4477.34872794341 198.44374606526
8229 25305 90597 22306 12204 43809 72611 92490
3471.41174520692 19457.71703038830 2043.15297336797 4097.16219949421 4.35559523239 9112.80946625392 658.08703071803 0.00008242896
36244 45399 81580 6519 92199 19242 87320 82446
15514.47656661454 7809.37088178496 92.83416820801 6890.50720299795 1349.25530879753 873.62344328968 558.61553871954 6171.22137005724
21875 58472 91095 62956 21323 13284 7976 9392
3470.74082849865 17686.58505790274 27237.85951361042 754.10851892279 3775.22729357851 19724.20206470836 9549.14886730312 87669.11705784447
3863 52253 26876 88684 13973 31334 39241 47959
2049.17274085257 90248.81714369814 1912.50625971966 3878.82996883584 3878.82996883584 3480.60055988084 60870.63334150079 5396.52386564679
28278 66394 72299 11367 95283 13602 5051 16920
1.62450213721 4316.84018647315 15514.47656661454 2610.57401240528 1029.57597231456 200935.40333778711 2500.90798638618 407.63052655813
29660 56659
3589.56681592925 3833.21879079043
# Calculate sum of squares when using the model
sse <- sum((model$residuals)^2)
sse
[1] 925380.4
Note: I used SSE here, but SSM or
SSR or SSresid are other names for the same
quantity.
1 - sse/sst
[1] 0.3424222
# Find the correlation
giving_correlation <- cor(sampleDonors$largestGiftAmount, sampleDonors$totalGivingAmount)
giving_correlation
[1] 0.5851685
# Square the correlation
giving_correlation^2
[1] 0.3424222
An amazing connection:
# Find the correlation
giving_correlation <- cor(sampleDonors$largestGiftAmount, sampleDonors$totalGivingAmount)
giving_correlation
[1] 0.5851685
# Square the correlation
giving_correlation^2
[1] 0.3424222
\[ r^2 = 1-\frac{SSE}{SST} \]
We want \(r^2\) to be as large as possible (i.e., close to 1.00); we can compare \(r^2\) across various models.
\(r^2\): “the coefficient of determination”
Measures of the percent reduction in the sum of squared residuals when using the LSRL to make predictions, rather than using the mean value of y (the response variable)
Measures the proportion (or percentage) of the variation in the response variable that is accounted for by the LSRL using the explanatory variable
# Regression output provides R-squared
summary(model)
Interpret \(r^2 = 0.34\) in this context.
The LSRL using largest gift amount accounts for 34% of the variability in total giving amount.
34% of the variability in total giving amount is accounted for by the LSRL using largest gift amount.
The linear model using [explanatory variable] accounts for \(r^2\) of the variability in [response variable].
position <- 1:10
rate <- c(23.53, 14.94, 11.19, 7.47, 5.29, 3.8, 2.79, 2.11, 1.57, 1.18)
par(mfrow = c(1, 2))
plot(position,
rate,
pch = 16,
cex = 2,
las = 1,
xlab = "Position",
ylab = "Click-through Rate",
main = "Should we use the mean?")
abline(h = mean(rate), lwd = 2, col = "blue")
plot(position,
rate,
pch = 16,
cex = 2,
las = 1,
xlab = "Position",
ylab = "Click-through Rate",
main = "Should we use the model?")
abline(lm(rate~position), lwd = 2, col = "red")
cor(position, rate)^2
Interpret \(r^2=0.8144\) in context.
Use a statistic from a sample to learn about an unknown parameter from a population.
Summary of some parameters and statistics in this course:
| Data Type | Parameter | Statistic | |
|---|---|---|---|
| Means | Quantitative | \(\mu\) | \(\bar{x}\) |
| Difference in Means | Quantitative | \(\mu_1-\mu_2\) | \(\bar{x}_1-\bar{x}_2\) |
| Proportion | Categorical | \(p\) | \(\widehat{p}\) |
| Difference in Proportions | Categorical | \(p_1-p_2\) | \(\widehat{p }_1-\widehat{p}_2\) |
| Slope | Quantitative | \(\beta\) | \(b\) or \(b_1\) or \(\widehat{\beta}_1\) |
Confidence Interval: We would like to estimate the true population slope \(\beta\) between two variables using our sample slope \(b\).
# I am not setting a random seed here, so your results will differ from mine.
# Pick 4 different random samples
sampleIDs1 <- sample(1:95412, size = 50, replace = FALSE)
sampleIDs2 <- sample(1:95412, size = 50, replace = FALSE)
sampleIDs3 <- sample(1:95412, size = 50, replace = FALSE)
sampleIDs4 <- sample(1:95412, size = 50, replace = FALSE)
# Select the columns I'm interested in
myColumns <- c("age", "incomeRating", "totalGivingAmount", "largestGiftAmount", "isHomeowner", "gender", "urbanicity", "respondedMailing")
# Store the randomly selected rows on the given columns
sampleDonors1 <- donors[sampleIDs1, myColumns]
sampleDonors2 <- donors[sampleIDs2, myColumns]
sampleDonors3 <- donors[sampleIDs3, myColumns]
sampleDonors4 <- donors[sampleIDs4, myColumns]
# Calculate linear model for each sample
mod1 <- lm(sampleDonors1$totalGivingAmount~sampleDonors1$largestGiftAmount)
mod2 <- lm(sampleDonors2$totalGivingAmount~sampleDonors2$largestGiftAmount)
mod3 <- lm(sampleDonors3$totalGivingAmount~sampleDonors3$largestGiftAmount)
mod4 <- lm(sampleDonors4$totalGivingAmount~sampleDonors4$largestGiftAmount)
# Store the slope from each model
mod1slope <- round(mod1$coefficients[2], 1)
mod2slope <- round(mod2$coefficients[2], 1)
mod3slope <- round(mod3$coefficients[2], 1)
mod4slope <- round(mod4$coefficients[2], 1)
# Store r-squared for each model
mod1corr <- round(sqrt(summary(mod1)$r.squared), 2)
mod2corr <- round(sqrt(summary(mod2)$r.squared), 2)
mod3corr <- round(sqrt(summary(mod3)$r.squared), 2)
mod4corr <- round(sqrt(summary(mod4)$r.squared), 2)
# 2x2 viewing window
par(mfrow = c(2, 2))
# Plot 4 scatterplots with LSRL to illustrate sampling variability
plot(sampleDonors1$largestGiftAmount, sampleDonors1$totalGivingAmount, cex = 1.5, las = 1, pch = 16, xlab = "Largest Gift", ylab = "Total Giving", main = paste("b =", mod1slope, "and r =", mod1corr))
abline(mod1, col = "red", lwd = 1.5)
plot(sampleDonors2$largestGiftAmount, sampleDonors2$totalGivingAmount, cex = 1.5, las = 1, pch = 16, xlab = "Largest Gift", ylab = "Total Giving", main = paste("b =", mod2slope, "and r =", mod2corr))
abline(mod2, col = "red", lwd = 1.5)
plot(sampleDonors3$largestGiftAmount, sampleDonors3$totalGivingAmount, cex = 1.5, las = 1, pch = 16, xlab = "Largest Gift", ylab = "Total Giving", main = paste("b =", mod3slope, "and r =", mod3corr))
abline(mod3, col = "red", lwd = 1.5)
plot(sampleDonors4$largestGiftAmount, sampleDonors4$totalGivingAmount, cex = 1.5, las = 1, pch = 16, xlab = "Largest Gift", ylab = "Total Giving", main = paste("b =", mod4slope, "and r =", mod4corr))
abline(mod4, col = "red", lwd = 1.5)
# Note to self: making a function would have been wise here!
Estimate \(\pm\) Margin of Error
CI for a Population Mean\[\bar{x} \pm t^*_{df}(SEM)\] \[ b \pm t^*_{df}(SE_b) \]
where b is the slope, t* follows a
t-distribution with df = n - 2 and \(SE_b\) is the standard error of the
slope
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
cex = 1,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving")
# Plot LSRL
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
# t critical value
tcrit <- qt(p = 0.975, df = 50-2)
tcrit
[1] 2.010635
# Lower bound
3.910 - tcrit*0.782
[1] 2.337684
# Upper bound
3.910 + tcrit*0.782
[1] 5.482316
\[ b \pm t^*_{df}(SE_b) \]
# t critical value
tcrit <- qt(p = 0.975, df = 50-2)
tcrit
[1] 2.010635
# Lower bound
3.910 - tcrit*0.782
[1] 2.337684
# Upper bound
3.910 + tcrit*0.782
[1] 5.482316
confint(model, level = ...) to verify your
answer.plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
pch = 16,
las = 1,
xlab = "Largest Gift Amount",
ylab = "Total Giving")
plot(sampleDonors$largestGiftAmount,
sampleDonors$totalGivingAmount,
pch = 16,
las = 1,
xlab = "Largest Gift Amount",
ylab = "Total Giving")
# Calculate population regression line
popLine <- lm(donors$totalGivingAmount ~ donors$largestGiftAmount)
# Plot all donors
plot(donors$largestGiftAmount,
donors$totalGivingAmount,
cex = 1,
las = 1,
pch = 16,
xlab = "Largest Gift Amount",
ylab = "Total Giving",
main = paste("Population Slope = ", round(popLine$coefficients[2], 2)))
# Plot Population regression line
abline(popLine,
lwd = 2, # make the line wider
col = "black" # color the line
)
# Plot LSRL from sample model
abline(model,
lwd = 2, # make the line wider
col = "red" # color the line
)
# Add a legend
legend("bottomright",
legend = c("Population LSRL", "Sample LSRL"),
col = c("black", "red"),
pch = c(15, 15))
We are 95% confident that the interval (2.337, 5.482) captures the population slope of total giving on largest gift amount for all the donors in this organization.
The population slope was 2.64, so yes, our interval captured the parameter.
We took a sample of donors from the population of all donors.
# For reproducibility
set.seed(1)
# Randomly select 50 numbers to represent rows in dataframe
sampleIDs <- sample(1:95412, size = 50, replace = FALSE)
# Select the columns I'm interested in
myColumns <- c("age", "incomeRating", "totalGivingAmount", "largestGiftAmount", "isHomeowner", "gender", "urbanicity", "respondedMailing")
# Store the randomly selected rows on the given columns
sampleDonors <- donors[sampleIDs, myColumns]
sampleDonors
We would like to see if the sample slope is statistically significant (i.e., a __ P-value).
Goal for our models: predictors with low p-values and a model with a high \(r^2\)
\(H_0: \beta = 0\)
\(H_a: \beta \ne 0\)
where \(\beta\) = the slope of the population regression line of total giving on largest gift for all donors in this organization
summary(model)
Call:
lm(formula = totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
Residuals:
Min 1Q Median 3Q Max
-448.26 -62.28 -33.37 41.97 478.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.817 27.292 1.899 0.0636 .
largestGiftAmount 3.910 0.782 5.000 8.07e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138.8 on 48 degrees of freedom
Multiple R-squared: 0.3424, Adjusted R-squared: 0.3287
F-statistic: 25 on 1 and 48 DF, p-value: 8.075e-06
Since the p-value is less than the alpha level of 0.05, we have convincing evidence that the slope of the population regression line using largest gift to predict total giving is not zero.
In other words, there is convincing evidence of an association between largest gift amount and total giving for all donors in this organization.
\[ t = \frac{b_1-\beta}{SE_b} \]
# Test statistic
(3.910 -0) / .782
[1] 5
# P-value
2 * pt (q = 5, df = 50-2, lower.tail = FALSE)
[1] 8.061665e-06
For more on significance tests in regression, see LSR.
Just because a variable is statistically significant does not means it is practically significant!
Collect two quantitative variables on the same individuals and calculate a confidence interval and perform a significance test for the slope.
model_largest <- lm(totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
summary(model_largest)
Call:
lm(formula = totalGivingAmount ~ largestGiftAmount, data = sampleDonors)
Residuals:
Min 1Q Median 3Q Max
-448.26 -62.28 -33.37 41.97 478.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.817 27.292 1.899 0.0636 .
largestGiftAmount 3.910 0.782 5.000 0.00000807 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138.8 on 48 degrees of freedom
Multiple R-squared: 0.3424, Adjusted R-squared: 0.3287
F-statistic: 25 on 1 and 48 DF, p-value: 0.000008075
predict(model_largest, data.frame(largestGiftAmount = 50))
1
247.2973
predict(model, dataframe)
predict(model_largest, data.frame(largestGiftAmount = 50))
1
247.2973
model_gender <- lm(totalGivingAmount ~ gender, data = sampleDonors)
table(sampleDonors$gender)
female male
27 22
summary(model_gender)
Call:
lm(formula = totalGivingAmount ~ gender, data = sampleDonors)
Residuals:
Min 1Q Median 3Q Max
-146.25 -106.25 -56.17 23.83 759.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139.17 33.05 4.211 0.000114 ***
gendermale 22.08 49.32 0.448 0.656465
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 171.7 on 47 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.004245, Adjusted R-squared: -0.01694
F-statistic: 0.2004 on 1 and 47 DF, p-value: 0.6565
\[ \hat{y} = 139.17 + 22.08(male) \]
0 = not male and 1 = yes male
predict(model_gender, data.frame(gender = "male"))
1
161.2455
139.17 + 22.08*1
[1] 161.25
predict(model_gender, data.frame(gender = c("male", "female")))
1 2
161.2455 139.1667
Notice 139.17 + 22.08(0) = 139.17, the y-intercept
If a donor in this sample is male, we expect $22.08 more than if the donor is a female.
Note, however, that including gender is not statistically significant (P-value = 0.656465), which means the difference we observe between male and female donors may be due to chance, and not an actual inherent difference in giving.
table(sampleDonors$urbanicity)
city rural suburb town urban
9 14 7 8 11
model_urbanicity <- lm(totalGivingAmount ~ urbanicity, data = sampleDonors)
model_urbanicity
Call:
lm(formula = totalGivingAmount ~ urbanicity, data = sampleDonors)
Coefficients:
(Intercept) urbanicityrural urbanicitysuburb urbanicitytown urbanicityurban
199.28 -113.63 18.72 -84.47 -60.47
\[ \hat{y} = 199.28 - 113.63(rural) + 18.72(suburb) - 84.47(town) - 60.47(urban) \]
a. Predict the total giving for a donor in this sample who lives in a rural area.
199.28 - 113.63*(1)
b. Predict the total giving for a donor in this sample who lives in the suburbs.
199.28 + 0 + 0
[1] 199.28
c. Predict the total giving for a donor in this sample who lives in a city (wait…).
199.28 - 113.63*(1)
[1] 85.65
City as “reference group”
Consider the data below from a recent year on six-year graduation rate (%), instructional expenditure per full-time student (in dollars), and median SAT score for 9 primarily undergraduate public universities and colleges in the western United States with enrollments between 10,000 and 20,000 (from Peck, et al., Stats: Learning from Data, 2nd ed., 234)
rate <- c(75, 71.5, 59.3, 56.4, 52.4, 48, 45.8, 42.7, 41.1)
expend <- c(6960, 7274, 5361, 5374, 5070, 5226, 5927, 5600, 5073)
SAT <- c(1242, 1114, 1014, 1070, 920, 888, 970, 937, 871)
colleges <- data.frame(rate, expend, SAT)
colleges
# Create two models, predicting graduation rate with different predictors
model_expend <- lm(rate ~ expend)
model_SAT <- lm(rate ~ SAT)
# Output the summary of each model
summary(model_expend)
Call:
lm(formula = rate ~ expend)
Residuals:
Min 1Q Median 3Q Max
-10.8078 -5.5289 -0.4167 6.2539 9.3058
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.648243 20.183306 -0.627 0.551
expend 0.011685 0.003472 3.366 0.012 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.012 on 7 degrees of freedom
Multiple R-squared: 0.6181, Adjusted R-squared: 0.5635
F-statistic: 11.33 on 1 and 7 DF, p-value: 0.01199
summary(model_SAT)
Call:
lm(formula = rate ~ SAT)
Residuals:
Min 1Q Median 3Q Max
-5.952 -4.438 -1.505 3.838 6.631
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.20052 15.54075 -2.394 0.04790 *
SAT 0.09162 0.01540 5.951 0.00057 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.267 on 7 degrees of freedom
Multiple R-squared: 0.835, Adjusted R-squared: 0.8114
F-statistic: 35.41 on 1 and 7 DF, p-value: 0.0005695
# Print r-squared
summary(model_expend)$r.squared
[1] 0.6180944
summary(model_SAT)$r.squared
[1] 0.8349573
# Print the p-value for the slope of each model, given it's in the 2nd row and 4 column
summary(model_expend)$coefficients[2, 4]
[1] 0.01198762
summary(model_SAT)$coefficients[2, 4]
[1] 0.0005695496
Choice:
Linear Regression Model with One Predictor
\[ \widehat{y} = b_0 + b_1 x_1 \]
Multiple Regression Model with p predictors:
\[ \widehat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p \]
To assess a multiple regression model, we use adjusted \(r^2\), which penalizes a model that uses too many useless predictors.
In other words, adjusted \(r^2\) will only increase if the new variables improve the model performance more than you’d expect by chance. For more info, see LSR here.
lm(y ~ x1)
lm(y ~ x1 + x2 + x3 + …)
# Create multiple regression
model_mult <- lm(rate ~ expend + SAT)
summary(model_mult)
Call:
lm(formula = rate ~ expend + SAT)
Residuals:
Min 1Q Median 3Q Max
-6.567 -2.919 -1.379 4.523 5.789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.777476 16.503961 -2.289 0.0620 .
expend 0.002011 0.004115 0.489 0.6425
SAT 0.080646 0.027766 2.905 0.0272 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.579 on 6 degrees of freedom
Multiple R-squared: 0.8413, Adjusted R-squared: 0.7884
F-statistic: 15.9 on 2 and 6 DF, p-value: 0.003999
# Create multiple regression
model_mult <- lm(rate ~ expend + SAT)
summary(model_mult)
Call:
lm(formula = rate ~ expend + SAT)
Residuals:
Min 1Q Median 3Q Max
-6.567 -2.919 -1.379 4.523 5.789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.777476 16.503961 -2.289 0.0620 .
expend 0.002011 0.004115 0.489 0.6425
SAT 0.080646 0.027766 2.905 0.0272 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.579 on 6 degrees of freedom
Multiple R-squared: 0.8413, Adjusted R-squared: 0.7884
F-statistic: 15.9 on 2 and 6 DF, p-value: 0.003999
Comments
R-squared increased slightly to 0.8413 (up from 0.835)
Adjusted R-squared decreased to 0.7884 (down from 0.8114)
The p-value is higher now for SAT.
The p-value is not significant for expenditure.
SAT => 0.081 => 0.081%
Using more realistic units:
Create your own multiple regression model (or models!) for data that interests you. Put it in your portfolio!
This material is for enrolled students’ academic use only and protected under U.S. Copyright Laws. This content must not be shared outside the confines of this course, in line with Eastern University’s academic integrity policies. Unauthorized reproduction, distribution, or transmission of this material, including but not limited to posting on third-party platforms like GitHub, is strictly prohibited and may lead to disciplinary action. You may not alter or remove any copyright or other notice from copies of any content taken from BrightSpace or Eastern University’s website. © Copyright Notice 2024, Eastern University - All Rights Reserved